174

Applications in Computer Vision

6.5.1

Preliminaries

In a specific convolution layer, wRCout×Cin×K×K, ainRCin×Win×Hin, and aout

RCout×Wout×Hout represent its weights and feature maps, where Cin and Cout represents the

number of channels. (H, W) are the height and width of the feature maps, and K denotes

the size of the kernel. Then we have the following.

aout = ainw,

(6.78)

whereis the convolution operation. We omit the batch normalization (BN) and ac-

tivation layers for simplicity. The 1-bit model aims to quantize w and ain into bw

{−1, +1}Cout×Cin×K×K and bain ∈{−1, +1}Cin×H×W using efficient XNOR and Bit-count

operations to replace full-precision operations. Following [48], the forward process of the 1-

bit CNN is

aout = αbain bw,

(6.79)

whereis the XNOR, and bit-count operations, anddenotes channel-wise multiplication.

α = [α1, · · · , αCout]R+ is the vector consisting of channel-wise scale factors. b = sign(·)

denotes the binarized variable using the sign function, which returns 1 if the input is greater

than zero and -1 otherwise. It then enters several non-linear layers, e.g., BN layer, non-

linear activation layer, and the max-pooling layer. We omit these for simplification. Then,

the output aout is binarized to baout via the sign function. The fundamental objective of

BNNs is to calculate w. We want it to be as close as possible before and after binarization

to minimize the binarization effect. Then, we define the reconstruction error as

LR(w, α) = wαbw.

(6.80)

6.5.2

Select Proposals with Information Discrepancy

To eliminate the large magnitude scale difference between the real valued teacher and the

1-bit student, we introduce a channelwise transformation for the proposals1 of the inter-

mediate neck. We first apply a transformation ϕ(·) on a proposal ˜RnRC×W ×H and

have

Rn;c(x, y) = ϕ( ˜Rn;c(x, y)) =

exp(

˜

Rn;c(x,y)

T

)



(x,y)(W,H) exp(

˜

Rn;c(xy)

T

)

,

(6.81)

where (x, y)(W, H) denotes a specific spatial location (x, y) in the spatial range (W, H),

and c ∈{1, · · · , C} is the channel index. n ∈{1, · · · , N} is the proposal index. N denotes the

number of proposals. T denotes a hyper-parameter controlling the statistical attributions

of the channel-wise alignment operation2. After the transformation, the features in each

channel of a proposal are projected into the same feature space [231] and follow a Gaussian

distribution as

p(Rn;c) ∼N(μn;c, σ2

n;c).

(6.82)

We further evaluate the information discrepancy between the teacher and the student

proposals. As shown in Fig. 6.16, the teacher and the student have NT and NS proposals,

respectively. Every proposal in one model generates a counterpart feature map patch in the

same location as in the other model. Thus, total NT + NS proposal pairs are considered.

To evaluate the information discrepancy, we introduce the Mahalanobis distance of each

1In this paper, the proposal denotes the neck/backbone feature map patched by the region proposal of

detectors.

2In this section, we set T = 4.